Skip to content

feat(eval): seed hand-graded golden cases so the benchmark can run (GEPA)#660

Open
Victor "David" Medina (Victor-David-Medina) wants to merge 1 commit into
mainfrom
claude/gepa-golden-seed
Open

feat(eval): seed hand-graded golden cases so the benchmark can run (GEPA)#660
Victor "David" Medina (Victor-David-Medina) wants to merge 1 commit into
mainfrom
claude/gepa-golden-seed

Conversation

@Victor-David-Medina

Copy link
Copy Markdown
Collaborator

GEPA benchmark: seed the hand-graded golden cases so it can actually run

Check-before-build first. The scoped task was "build the GEPA golden-set (recovery-draft + morning-brief)." Reading the eval system first showed that would be 100% duplication - the golden set already exists:

  • lib/eval/golden-cases-recovery-v1.ts - 56 graded cases across all 4 revenue workflows (lapsed-winback, estimate-recovery, review-lift, slot-rescue), each with an ideal reference_verdict + graded dimensions.
  • lib/eval/golden-cases-v1.ts - 50 graded cases including morning-brief, churn-prediction, isa-routing, content, council.

The real gap (a genuine "non-working" section)

Those 106 cases are imported by nothing except the lib/eval barrel. The eval/PGR benchmark reads golden cases from the evaluation_golden_cases Supabase table (golden-dataset.ts), and scripts/run-first-eval.ts bails immediately:

if (!cases || cases.length === 0) { console.log("No active golden cases found. Seed cases first."); process.exit(0); }

There is no seed that bridges the static constants into that table. The graded content was orphaned from the runtime path - the benchmark could not run at all.

The fix (wire it live, don't rebuild)

  • lib/eval/golden-seed.ts - pure source: getSeedGoldenCases() (combines both sets) + toGoldenCaseInsert() (maps a static GoldenCase to a DB row; drops the client string id since the table auto-generates a UUID, drops timestamps). No IO, unit-testable.
  • scripts/seed-golden-cases.ts - idempotent seed: upsert-by-title (safe to re-run, never deletes), --dry-run preview, founder-gated on prod (writes to Supabase).
  • __tests__/golden-seed.test.ts - pure (no DB): combined set complete + unique titles (the dedup key) + covers the graded workflows incl morning-brief + maps cleanly to the insert row + every case is actually graded (not schema-only).

After merge: npx tsx scripts/seed-golden-cases.tsnpx tsx scripts/run-first-eval.ts → the "we grade ourselves" PGR benchmark is live. Honest at proof_events=0: golden cases are representative fixtures grading draft quality, not recovered dollars.

Generated with Claude Code by RelayLaunch

…n (GEPA)

Check-before-build found the golden set ALREADY EXISTS (106 hand-graded cases:
GOLDEN_CASES_V1 + GOLDEN_CASES_RECOVERY_V1, covering the 4 revenue workflows +
morning-brief/churn/isa), so building new golden cases would be pure duplication.

The REAL gap: those cases live only as static TS constants, imported by nothing
but the barrel. The eval/PGR benchmark reads from the evaluation_golden_cases
Supabase table, and run-first-eval.ts bails with 'No active golden cases found.
Seed cases first.' The graded content was orphaned from the runtime path - the
benchmark could not run.

This bridges them: lib/eval/golden-seed.ts (pure source + DB-row mapper) +
scripts/seed-golden-cases.ts (idempotent upsert-by-title, --dry-run, founder-gated
on prod). Now seed once -> run-first-eval establishes PGR baselines -> the
'we grade ourselves' benchmark is live. Honest at proof=0: golden cases are
representative fixtures grading draft QUALITY, not recovered dollars.

Tests (pure, no DB): combined set complete + unique titles (the dedup key) +
covers the graded workflows + maps cleanly to the insert row.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions

github-actions Bot commented Jul 2, 2026

Copy link
Copy Markdown

🛡️ Cascade Quality Score: 100/100

Category Score Status
TypeScript 20/20
ESLint 20/20
Brand Compliance 15/15
Test Suite 25/25
Build 20/20

Threshold: 85/100 | Result: PASS ✅

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant